Load the dataset

The collected dataset includes information about houses on sale in the Dublin area. Each house is an entry of the dataset: a mixed-type data comprising of numerical, categorical and textual data.

The goal is to combine both numerical/categorical features and textual features to predict if the house-price is above or below 550,000

The house price is determined by some factors like

The physical attributes of the house such as the number of bedrooms, the number of bathrooms, the surface of the house, property type, and its location are information that is directly accessible from the dataset. Instead, the house-features can (sometimes only indirectly) be inferred from the house-description, house-facility and house-features. You can download the dataset from this url: https://github.com/benavoli/ST8003/tree/main/session05 You can see a typical entry in the dataset hereafter

data <- read.csv(file = '../session5/train.csv',sep="," )
data['pricerange']<-as.vector(data['price']>550000)+0.0# we make a column which is 1 when price 
#is above 550000 and zero otherwise
data[1,]

Data Cleaning, Covariate selection and preprocessing

We select some of the columns (‘bathrooms’,‘beds’,‘surface’) we will use as predictors for price

datasel = data[c('bathrooms','beds','surface','pricerange')]
datasel = na.omit(datasel)# we remove all the rows including nan
datasel

Linear regression

We now fit linear regression

model = glm(pricerange ~ bathrooms + beds + surface,  family = "binomial", data = datasel)
glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)

Call:
glm(formula = pricerange ~ bathrooms + beds + surface, family = "binomial", 
    data = datasel)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2076  -0.6062  -0.3426   0.5130   3.3574  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.646e+00  2.300e-01 -24.544  < 2e-16 ***
bathrooms    3.693e-01  6.577e-02   5.615 1.97e-08 ***
beds         1.223e+00  7.267e-02  16.824  < 2e-16 ***
surface      6.822e-05  2.379e-05   2.867  0.00415 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2889.9  on 2401  degrees of freedom
Residual deviance: 2022.6  on 2398  degrees of freedom
AIC: 2030.6

Number of Fisher Scoring iterations: 5

Is this a good model? Can we use other columns in data to improve the model? Can we include polynomial and/or interaction terms to improve the model? Use the model selection approaches you learned in session 5 and 6 to find a better model.

Unseen data

You can test the predictive performance of our best model on unseen data

datatest <- read.csv(file = '../session5/test.csv',sep="," )
datatest[10:28,3:16]#one of the entries, there are 16 columns, the first two columns are just ids. The price column is not reported. You have to predict the price for all the entries in dataset

Prediction

predictions <- predict(model,datatest, type="response")
predictions[1:5]
        1         2         3         4         5 
0.4980193 0.2257686 0.5898662 0.3792981 0.1675038 

these are the predicted probabilities for pricerange to be 1 for 5 houses in the dataset. You can save and submit your best predictions for our internal data science competition. This is the code, which uses the threshold 0.5 to predict 1, that is house price above 550000.

write.csv(predictions,"name_surname.csv")

We will use accuracy score to evaluate the accuracy of your predictions.

LS0tCnRpdGxlOiAiU2Vzc2lvbiA1OiBsaW5lYXIgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBMb2FkIHRoZSBkYXRhc2V0IApUaGUgY29sbGVjdGVkIGRhdGFzZXQgaW5jbHVkZXMgaW5mb3JtYXRpb24gYWJvdXQgaG91c2VzIG9uIHNhbGUgaW4gdGhlIER1YmxpbiBhcmVhLiBFYWNoIGhvdXNlIGlzIGFuIGVudHJ5IG9mIHRoZSBkYXRhc2V0OiBhIG1peGVkLXR5cGUgZGF0YSBjb21wcmlzaW5nIG9mIG51bWVyaWNhbCwgY2F0ZWdvcmljYWwgYW5kIHRleHR1YWwgZGF0YS4KClRoZSBnb2FsIGlzIHRvIGNvbWJpbmUgYm90aCBudW1lcmljYWwvY2F0ZWdvcmljYWwgZmVhdHVyZXMgYW5kIHRleHR1YWwgZmVhdHVyZXMgdG8gcHJlZGljdCBpZiB0aGUgaG91c2UtcHJpY2UgaXMgYWJvdmUgb3IgYmVsb3cgNTUwLDAwMAoKVGhlIGhvdXNlIHByaWNlIGlzIGRldGVybWluZWQgYnkgc29tZSBmYWN0b3JzIGxpa2UKCiogbG9jYXRpb24gKGFyZWEpLAoqIHN1cmZhY2UgKHNpemUpLAoqIHRoZSBudW1iZXIgb2YgYmVkcm9vbXMsCiogdGhlIG51bWJlciBvZiBiYXRocm9vbXMsCiogcHJvcGVydHkgdHlwZSwKKiBob3VzZS1mZWF0dXJlcyAoc2l6ZSBvZiB0aGUgd2luZG93cywgY29uc3RydWN0aW9uIG1hdGVyaWFsKS4KClRoZSBwaHlzaWNhbCBhdHRyaWJ1dGVzIG9mIHRoZSBob3VzZSBzdWNoIGFzIHRoZSBudW1iZXIgb2YgYmVkcm9vbXMsIHRoZSBudW1iZXIgb2YgYmF0aHJvb21zLCB0aGUgc3VyZmFjZSBvZiB0aGUgaG91c2UsIHByb3BlcnR5IHR5cGUsIGFuZCBpdHMgbG9jYXRpb24gYXJlIGluZm9ybWF0aW9uIHRoYXQgaXMgZGlyZWN0bHkgYWNjZXNzaWJsZSBmcm9tIHRoZSBkYXRhc2V0LgpJbnN0ZWFkLCB0aGUgaG91c2UtZmVhdHVyZXMgY2FuIChzb21ldGltZXMgb25seSBpbmRpcmVjdGx5KSBiZSBpbmZlcnJlZCBmcm9tIHRoZSBob3VzZS1kZXNjcmlwdGlvbiwgaG91c2UtZmFjaWxpdHkgYW5kIGhvdXNlLWZlYXR1cmVzLgpZb3UgY2FuIGRvd25sb2FkIHRoZSBkYXRhc2V0IGZyb20gdGhpcyB1cmw6Cmh0dHBzOi8vZ2l0aHViLmNvbS9iZW5hdm9saS9TVDgwMDMvdHJlZS9tYWluL3Nlc3Npb24wNQpZb3UgY2FuIHNlZSBhIHR5cGljYWwgZW50cnkgaW4gdGhlIGRhdGFzZXQgaGVyZWFmdGVyCgpgYGB7cn0KZGF0YSA8LSByZWFkLmNzdihmaWxlID0gJy4uL3Nlc3Npb241L3RyYWluLmNzdicsc2VwPSIsIiApCmRhdGFbJ3ByaWNlcmFuZ2UnXTwtYXMudmVjdG9yKGRhdGFbJ3ByaWNlJ10+NTUwMDAwKSswLjAjIHdlIG1ha2UgYSBjb2x1bW4gd2hpY2ggaXMgMSB3aGVuIHByaWNlIAojaXMgYWJvdmUgNTUwMDAwIGFuZCB6ZXJvIG90aGVyd2lzZQpkYXRhWzEsXQpgYGAKCiMgRGF0YSBDbGVhbmluZywgQ292YXJpYXRlIHNlbGVjdGlvbiBhbmQgcHJlcHJvY2Vzc2luZwpXZSBzZWxlY3Qgc29tZSBvZiB0aGUgY29sdW1ucyAoJ2JhdGhyb29tcycsJ2JlZHMnLCdzdXJmYWNlJykgd2Ugd2lsbCB1c2UgYXMgcHJlZGljdG9ycyBmb3IgcHJpY2UKYGBge3J9CmRhdGFzZWwgPSBkYXRhW2MoJ2JhdGhyb29tcycsJ2JlZHMnLCdzdXJmYWNlJywncHJpY2VyYW5nZScpXQpkYXRhc2VsID0gbmEub21pdChkYXRhc2VsKSMgd2UgcmVtb3ZlIGFsbCB0aGUgcm93cyBpbmNsdWRpbmcgbmFuCmRhdGFzZWwKYGBgCgojIExpbmVhciByZWdyZXNzaW9uCldlIG5vdyBmaXQgbGluZWFyIHJlZ3Jlc3Npb24KYGBge3J9Cm1vZGVsID0gZ2xtKHByaWNlcmFuZ2UgfiBiYXRocm9vbXMgKyBiZWRzICsgc3VyZmFjZSwgIGZhbWlseSA9ICJiaW5vbWlhbCIsIGRhdGEgPSBkYXRhc2VsKQpzdW1tYXJ5KG1vZGVsKQpgYGAKSXMgdGhpcyBhIGdvb2QgbW9kZWw/IENhbiB3ZSB1c2Ugb3RoZXIgY29sdW1ucyBpbiBgZGF0YWAgdG8gaW1wcm92ZSB0aGUgbW9kZWw/CkNhbiB3ZSBpbmNsdWRlIHBvbHlub21pYWwgYW5kL29yIGludGVyYWN0aW9uIHRlcm1zIHRvIGltcHJvdmUgdGhlIG1vZGVsPwpVc2UgdGhlIG1vZGVsIHNlbGVjdGlvbiBhcHByb2FjaGVzIHlvdSBsZWFybmVkIGluIHNlc3Npb24gNSBhbmQgNiB0byBmaW5kIGEgYmV0dGVyIG1vZGVsLgoKCiMgVW5zZWVuIGRhdGEKWW91IGNhbiB0ZXN0IHRoZSBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG9mIG91ciBiZXN0IG1vZGVsIG9uIHVuc2VlbiBkYXRhCmBgYHtyfQpkYXRhdGVzdCA8LSByZWFkLmNzdihmaWxlID0gJy4uL3Nlc3Npb241L3Rlc3QuY3N2JyxzZXA9IiwiICkKZGF0YXRlc3RbMTA6MjgsMzoxNl0jb25lIG9mIHRoZSBlbnRyaWVzLCB0aGVyZSBhcmUgMTYgY29sdW1ucywgdGhlIGZpcnN0IHR3byBjb2x1bW5zIGFyZSBqdXN0IGlkcy4gVGhlIHByaWNlIGNvbHVtbiBpcyBub3QgcmVwb3J0ZWQuIFlvdSBoYXZlIHRvIHByZWRpY3QgdGhlIHByaWNlIGZvciBhbGwgdGhlIGVudHJpZXMgaW4gZGF0YXNldApgYGAKClByZWRpY3Rpb24KYGBge3J9CnByZWRpY3Rpb25zIDwtIHByZWRpY3QobW9kZWwsZGF0YXRlc3QsIHR5cGU9InJlc3BvbnNlIikKcHJlZGljdGlvbnNbMTo1XQpgYGAKdGhlc2UgYXJlIHRoZSBwcmVkaWN0ZWQgcHJvYmFiaWxpdGllcyBmb3IgcHJpY2VyYW5nZSB0byBiZSAxIGZvciA1IGhvdXNlcyBpbiB0aGUgZGF0YXNldC4gWW91IGNhbiBzYXZlIGFuZCBzdWJtaXQgeW91ciBiZXN0IHByZWRpY3Rpb25zIGZvciBvdXIgaW50ZXJuYWwgZGF0YSBzY2llbmNlIGNvbXBldGl0aW9uLiBUaGlzIGlzIHRoZQpjb2RlLCB3aGljaCB1c2VzIHRoZSB0aHJlc2hvbGQgMC41IHRvIHByZWRpY3QgMSwgdGhhdCBpcyBob3VzZSBwcmljZSBhYm92ZSA1NTAwMDAuCgpgYGB7cn0Kd3JpdGUuY3N2KGFzLmFycmF5KHByZWRpY3Rpb25zPjAuNSksIm5hbWVfc3VybmFtZS5jc3YiKQpgYGAKV2Ugd2lsbCB1c2UgYGFjY3VyYWN5IHNjb3JlYCB0byBldmFsdWF0ZSB0aGUgYWNjdXJhY3kgb2YKeW91ciBwcmVkaWN0aW9ucy4=